Author

Jakub Nowosad

Published

Last Updated: February, 2025

Show the code
spain = sf::read_sf("data/spain.gpkg")
covariates = terra::rast("data/predictors.tif")
temperature = sf::read_sf("data/temp_train.gpkg")

temperature = terra::extract(covariates, temperature, bind = TRUE) |> 
  sf::st_as_sf()

1 RandomForestsGLS

The RandomForestsGLS (https://doi.org/10.21105/joss.03780) package implements the Generalised Least Square (GLS) based Random Forest (RF-GLS) algorithm.1 This approach is designed for spatial data modeling as it accounts for spatial dependencies in the data by:

  1. Using a global dependency-adjusted split criterion and node representatives instead of the classification and regression tree (CART) criterion used in standard RF models
  2. Applying contrast resampling rather than the bootstrap method used in a standard RF model
  3. Employing residual kriging with covariance modeled using a Gaussian process framework

The package provides four functions:

  1. RFGLS_estimate_spatial() for estimation in spatial data
  2. RFGLS_predict() for prediction of the mean function
  3. RFGLS_predict_spatial() for prediction of the spatial response
  4. RFGLS_estimate_timeseries() for estimation in time series data (not discussed here)

The package has rather unintuitive syntax and requires the data to be in a specific format. We need to provide the coordinates of the data (a matrix), the response variable (a vector), and the covariates (a matrix). In the example below, I limited the covariate matrix to the variables that are not spatial proxies.

Show the code
library(RandomForestsGLS)

coords = sf::st_coordinates(temperature)
temp_response = temperature$temp

temperature_df = sf::st_drop_geometry(temperature)
covariate_names = colnames(temperature_df)[2:(ncol(temperature_df) - 7)]
covariate_matrix = as.matrix(temperature_df[, covariate_names])

For the example, we also split the data into training and testing sets based on created random indices.

Show the code
set.seed(2025-01-30)
train_idx = sample(1:nrow(coords), floor(nrow(coords) * 0.7))

The RFGLS_estimate_spatial() function is used to fit the RF-GLS model. Here, we customize the number of trees to 100, but the function has many other parameters that can be adjusted.

Show the code
estimation_result = RFGLS_estimate_spatial(
  coords = coords[train_idx, ],
  y = temp_response[train_idx],
  X = covariate_matrix[train_idx, ],
  ntree = 100
)
str(estimation_result)
List of 7
 $ P_matrix        : int [1:136, 1:100] 18 78 88 123 31 80 49 67 65 98 ...
 $ predicted_matrix: num [1:136, 1:100] 17 17 13 14.9 13 ...
 $ predicted       : num [1:136] 16 15.4 10.9 13.8 15 ...
 $ X               : num [1:136, 1:15] 727 20.7 0 203.7 0 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:136] "88" "106" "24" "11" ...
  .. ..$ : chr [1:15] "popdens" "coast" "dem" "imd" ...
 $ y               : num [1:136] 16.13 15.21 7.53 13.61 15.59 ...
 $ coords          : num [1:136, 1:2] 495960 523533 273723 618599 616251 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:2] "X" "Y"
 $ RFGLS_object    :List of 6
  ..$ ldaughter : int [1:273, 1:100] 2 4 6 8 0 10 12 0 14 0 ...
  ..$ rdaughter : int [1:273, 1:100] 3 5 7 9 0 11 13 0 15 0 ...
  ..$ nodestatus: int [1:273, 1:100] -3 -3 -3 -3 -1 -3 -3 -1 -3 -1 ...
  ..$ upper     : num [1:273, 1:100] 27.029 0.216 76.038 0.287 0 ...
  ..$ avnode    : num [1:273, 1:100] 0 0 0 0 15.3 ...
  ..$ mbest     : int [1:273, 1:100] 14 10 2 8 0 2 8 0 2 0 ...

The result is a list with seven elements: a matrix of zero-indexed resamples, a matrix of predictions (ntest x ntree), a vector of predicted values, the covariate matrix, the response variable, the coordinates matrix, and the RF-GLS object.

Now, we can use the fitted model to predict the mean function (RFGLS_predict()) or the spatial response (RFGLS_predict_spatial()). The difference (as far as I understand) is that the former returns the mean prediction, while the latter uses the spatial coordinates in addition to the covariates to predict the spatial response.

The first function returns a list with two elements: a matrix of predictions (ntest x ntree) and a vector of predicted values, while the second function returns a list with just one element: a vector of predicted values. Just note that the predictions by the RFGLS_predict() are named "predicted" and the predictions by the RFGLS_predict_spatial() are named "prediction".

Show the code
prediction_result = RFGLS_predict(
  RFGLS_out = estimation_result,
  Xtest = covariate_matrix[-train_idx, ]
)
prediction_result_spatial = RFGLS_predict_spatial(
  RFGLS_out = estimation_result,
  coords.0 = coords[-train_idx, ],
  Xtest = covariate_matrix[-train_idx, ]
)
plot(prediction_result$predicted, prediction_result_spatial$prediction)

The final results of these two approaches are v. similar, but not identical.

Now, let’s predict the models’ results on the whole dataset.

Show the code
covariate_coords_r = terra::crds(covariates)
covariate_matrix_r = as.matrix(covariates)
covariate_matrix_r = covariate_matrix_r[, covariate_names]

pred_s = RFGLS_predict_spatial(
  RFGLS_out = estimation_result,
  coords.0 = covariate_coords_r,
  Xtest = covariate_matrix_r
)

pred_r = terra::setValues(covariates[[1]], pred_s$prediction)
names(pred_r) = "prediction"
terra::plot(pred_r)

2 spatialRF

The spatialRF (https://blasbenito.github.io/spatialRF/) package’s aim is to provide a minimal code interface to fit spatial regression models with Random Forest. The internal calculations are based on three general methods to generate spatial predictors from the distance matrix of the data points: Distance matrix columns as explanatory variables (Hengl et al. 2018), Moran’s Eigenvector Maps (Dray, Legendre, and Peres-Neto 2006) and PCAs. The ranger package is used here internally to fit the Random Forest model.

This package also requires the data to be in a specific format. We need to provide the data as a data frame with the dependent variable, including spatial coordinates, and the distance matrix: a matrix with the distances among the records in the data frame.

Show the code
library(spatialRF)
library(sf)

coordinates = st_coordinates(temperature)
colnames(coordinates) = c("x", "y")
coordinates = as.data.frame(coordinates)

temperature_df = st_drop_geometry(temperature)
temperature_df$x = coordinates[, 1]
temperature_df$y = coordinates[, 2]

distance_matrix = as.matrix(dist(temperature_df[2:(ncol(temperature_df) - 9)]))

We also need to define the dependent variable and the predictor variables.

Show the code
response_name = "temp"
covariate_names = colnames(temperature_df)[2:(ncol(temperature_df) - 9)]

Finally, we can fit the models using one of the methods provided by the package. The package has 10 methods implemented, nine of which are based on the three components:2

  1. The method to generate spatial predictors ("hengl", "mem", or "pca")
  2. The method to rank spatial predictors ("moran" or "effect")
  3. The method to select spatial predictors ("sequential" or "recursive")

The main function of this package is rf_spatial(), which fits the Random Forest model with spatial predictors. Here, an example using Moran’s Eigenvector Maps method to generate spatial predictors, Moran’s I to rank them, and sequential selection of the predictors is shown.

Show the code
rf_spatial_moran = rf_spatial(
  data = temperature_df,
  dependent.variable.name = response_name,
  predictor.variable.names = covariate_names,
  distance.matrix = distance_matrix,
  distance.thresholds = 0,
  method = "mem.moran.sequential",
  n.cores = 1
)
rf_spatial_moran
Model type
  - Fitted with:                     ranger()
  - Response variable:               temp

Random forest parameters
  - Type:                            Regression
  - Number of trees:                 500
  - Sample size:                     195
  - Number of predictors:            15
  - Mtry:                            3
  - Minimum node size:               5


Model performance 
  - R squared (oob):                  0.887361
  - R squared (cor(obs, pred)^2):     0.9844978
  - Pseudo R squared (cor(obs, pred)):0.9922186
  - RMSE (oob):                       0.9476691
  - RMSE:                             0.4131
  - Normalized RMSE:                  0.103203

Model residuals 
  - Stats: 
               ┌───────┬────────┬────────┬──────┬────────┬──────┐
               │ Min.  │ 1st Q. │ Median │ Mean │ 3rd Q. │ Max. │
               ├───────┼────────┼────────┼──────┼────────┼──────┤
               │ -1.87 │  -0.24 │   0.02 │ 0.00 │   0.26 │ 1.53 │
               └───────┴────────┴────────┴──────┴────────┴──────┘
  - Normality: 
      - Shapiro-Wilks W: 0.971 
      - p-value        : 4e-04 
      - Interpretation : Residuals are not normal 

  - Spatial autocorrelation: 
             ┌──────────┬───────────┬─────────┬──────────────────┐
             │ Distance │ Moran's I │ P value │ Interpretation   │
             ├──────────┼───────────┼─────────┼──────────────────┤
             │      0.0 │     0.005 │   0.258 │ No spatial       │
             │          │           │         │ correlation      │
             └──────────┴───────────┴─────────┴──────────────────┘

Variable importance: 
                         ┌──────────────┬────────────┐
                         │ Variable     │ Importance │
                         ├──────────────┼────────────┤
                         │ lst_night    │      1.966 │
                         ├──────────────┼────────────┤
                         │ lst_day      │      1.342 │
                         ├──────────────┼────────────┤
                         │ dem          │      1.245 │
                         ├──────────────┼────────────┤
                         │ CAMSpm25     │      0.936 │
                         ├──────────────┼────────────┤
                         │ coast        │      0.745 │
                         ├──────────────┼────────────┤
                         │ ndvi         │      0.683 │
                         ├──────────────┼────────────┤
                         │ imd          │      0.420 │
                         ├──────────────┼────────────┤
                         │ ntl          │      0.401 │
                         ├──────────────┼────────────┤
                         │ industry     │      0.291 │
                         ├──────────────┼────────────┤
                         │ popdens      │      0.270 │
                         ├──────────────┼────────────┤
                         │ urban        │      0.218 │
                         ├──────────────┼────────────┤
                         │ natural      │      0.212 │
                         ├──────────────┼────────────┤
                         │ agriculture  │      0.196 │
                         ├──────────────┼────────────┤
                         │ otherroads   │      0.181 │
                         ├──────────────┼────────────┤
                         │ primaryroads │      0.095 │
                         └──────────────┴────────────┘

The rf_spatial() returns a ranger model with several new slows, most importantly residuals that contain information about the residuals, and spatial that contains information about the selected spatial predictors and the method used to select them. Printing the model object provides a summary of the model, including its parameters, model performance, information on model residuals, and variable importance.

The spatialRF package also provides a set of additional functions. It includes a function for reducing multicollinearity in the predictors and removing redundant spatial predictors (filter_spatial_predictors()); or finding promising variable interactions (the_feature_engineer()):

Show the code
interactions = the_feature_engineer(
  data = temperature_df,
  dependent.variable.name = response_name,
  predictor.variable.names = covariate_names,
  xy = coordinates,
  importance.threshold = 0.50, #uses 50% best predictors
  cor.threshold = 0.60, #max corr between interactions and predictors
  seed = 2025-01-30,
  repetitions = 100,
  verbose = TRUE
)
 ┌──────────────────┬──────────────────┬──────────────────┬──────────────────┐
 │ Interaction      │ Importance (% of │        R-squared │     Max cor with │
 │                  │             max) │      improvement │       predictors │
 ├──────────────────┼──────────────────┼──────────────────┼──────────────────┤
 │ dem..x..CAMSpm25 │             33.6 │            0.020 │            0.52  │
 ├──────────────────┼──────────────────┼──────────────────┼──────────────────┤
 │ dem..x..imd      │             22.1 │            0.014 │            0.54  │
 ├──────────────────┼──────────────────┼──────────────────┼──────────────────┤
 │ imd..pca..ntl    │             13.6 │            0.013 │            0.454 │
 └──────────────────┴──────────────────┴──────────────────┴──────────────────┘

The rf_evaluate() function allows the evaluation of the model using spatial cross-validation.

Show the code
rf_eval = rf_evaluate(
  model = rf_spatial_moran,
  xy = coordinates,
  repetitions = 30,
  training.fraction = 0.75,
  metrics = "rmse", 
  seed = 2025-01-30,
  verbose = TRUE
)

Spatial evaluation 
  - Training fraction:             0.75
  - Spatial folds:                 30

 Metric Median   MAD Minimum Maximum
   rmse  1.428 0.221   0.815   1.968
Show the code
rf_eval
Model type
  - Fitted with:                     ranger()
  - Response variable:               temp

Random forest parameters
  - Type:                            Regression
  - Number of trees:                 500
  - Sample size:                     195
  - Number of predictors:            15
  - Mtry:                            3
  - Minimum node size:               5


Model performance 
  - R squared (oob):                  0.887361
  - R squared (cor(obs, pred)^2):     0.9844978
  - Pseudo R squared (cor(obs, pred)):0.9922186
  - RMSE (oob):                       0.9476691
  - RMSE:                             0.4131
  - Normalized RMSE:                  0.103203

Model residuals 
  - Stats: 
               ┌───────┬────────┬────────┬──────┬────────┬──────┐
               │ Min.  │ 1st Q. │ Median │ Mean │ 3rd Q. │ Max. │
               ├───────┼────────┼────────┼──────┼────────┼──────┤
               │ -1.87 │  -0.24 │   0.02 │ 0.00 │   0.26 │ 1.53 │
               └───────┴────────┴────────┴──────┴────────┴──────┘
  - Normality: 
      - Shapiro-Wilks W: 0.971 
      - p-value        : 4e-04 
      - Interpretation : Residuals are not normal 

  - Spatial autocorrelation: 
             ┌──────────┬───────────┬─────────┬──────────────────┐
             │ Distance │ Moran's I │ P value │ Interpretation   │
             ├──────────┼───────────┼─────────┼──────────────────┤
             │      0.0 │     0.005 │   0.258 │ No spatial       │
             │          │           │         │ correlation      │
             └──────────┴───────────┴─────────┴──────────────────┘

Variable importance: 
                         ┌──────────────┬────────────┐
                         │ Variable     │ Importance │
                         ├──────────────┼────────────┤
                         │ lst_night    │      1.966 │
                         ├──────────────┼────────────┤
                         │ lst_day      │      1.342 │
                         ├──────────────┼────────────┤
                         │ dem          │      1.245 │
                         ├──────────────┼────────────┤
                         │ CAMSpm25     │      0.936 │
                         ├──────────────┼────────────┤
                         │ coast        │      0.745 │
                         ├──────────────┼────────────┤
                         │ ndvi         │      0.683 │
                         ├──────────────┼────────────┤
                         │ imd          │      0.420 │
                         ├──────────────┼────────────┤
                         │ ntl          │      0.401 │
                         ├──────────────┼────────────┤
                         │ industry     │      0.291 │
                         ├──────────────┼────────────┤
                         │ popdens      │      0.270 │
                         ├──────────────┼────────────┤
                         │ urban        │      0.218 │
                         ├──────────────┼────────────┤
                         │ natural      │      0.212 │
                         ├──────────────┼────────────┤
                         │ agriculture  │      0.196 │
                         ├──────────────┼────────────┤
                         │ otherroads   │      0.181 │
                         ├──────────────┼────────────┤
                         │ primaryroads │      0.095 │
                         └──────────────┴────────────┘

Spatial evaluation 
  - Training fraction:             0.75
  - Spatial folds:                 30

 Metric Median   MAD Minimum Maximum
   rmse  1.428 0.221   0.815   1.968

The rf_importance() function allows for visualizing the variable importance of the model.

Show the code
rf_imp = rf_importance(
 rf_spatial_moran,
  xy = coordinates
)

Show the code
rf_imp
Model type
  - Fitted with:                     ranger()
  - Response variable:               temp

Random forest parameters
  - Type:                            Regression
  - Number of trees:                 500
  - Sample size:                     195
  - Number of predictors:            15
  - Mtry:                            3
  - Minimum node size:               5


Model performance 
  - R squared (oob):                  0.887361
  - R squared (cor(obs, pred)^2):     0.9844978
  - Pseudo R squared (cor(obs, pred)):0.9922186
  - RMSE (oob):                       0.9476691
  - RMSE:                             0.4131
  - Normalized RMSE:                  0.103203

Model residuals 
  - Stats: 
               ┌───────┬────────┬────────┬──────┬────────┬──────┐
               │ Min.  │ 1st Q. │ Median │ Mean │ 3rd Q. │ Max. │
               ├───────┼────────┼────────┼──────┼────────┼──────┤
               │ -1.87 │  -0.24 │   0.02 │ 0.00 │   0.26 │ 1.53 │
               └───────┴────────┴────────┴──────┴────────┴──────┘
  - Normality: 
      - Shapiro-Wilks W: 0.971 
      - p-value        : 4e-04 
      - Interpretation : Residuals are not normal 

  - Spatial autocorrelation: 
             ┌──────────┬───────────┬─────────┬──────────────────┐
             │ Distance │ Moran's I │ P value │ Interpretation   │
             ├──────────┼───────────┼─────────┼──────────────────┤
             │      0.0 │     0.005 │   0.258 │ No spatial       │
             │          │           │         │ correlation      │
             └──────────┴───────────┴─────────┴──────────────────┘

Variable importance: 
           ┌────────────┬────────────┬─────────┬───────┬──────┬─────┐
           │ Variable   │ Importance │         │       │      │     │
           ├────────────┼────────────┼─────────┼───────┼──────┼─────┤
           │ lst_night  │      1.966 │  0.13   │ 0.07  │ 15.5 │ 8.4 │
           ├────────────┼────────────┼─────────┼───────┼──────┼─────┤
           │ lst_day    │      1.342 │  0.0695 │ 0.035 │  8.3 │ 4.2 │
           ├────────────┼────────────┼─────────┼───────┼──────┼─────┤
           │ dem        │      1.245 │  0.045  │ 0.019 │  5.4 │ 2.3 │
           ├────────────┼────────────┼─────────┼───────┼──────┼─────┤
           │ CAMSpm25   │      0.936 │ -0.014  │ 0.021 │ -1.7 │ 2.5 │
           ├────────────┼────────────┼─────────┼───────┼──────┼─────┤
           │ coast      │      0.745 │ -0.01   │ 0.01  │ -1.2 │ 1.1 │
           ├────────────┼────────────┼─────────┼───────┼──────┼─────┤
           │ ndvi       │      0.683 │  0.0055 │ 0.013 │  0.7 │ 1.5 │
           ├────────────┼────────────┼─────────┼───────┼──────┼─────┤
           │ imd        │      0.420 │ -0.005  │ 0.004 │ -0.6 │ 0.5 │
           ├────────────┼────────────┼─────────┼───────┼──────┼─────┤
           │ ntl        │      0.401 │ -0.007  │ 0.007 │ -0.8 │ 0.8 │
           ├────────────┼────────────┼─────────┼───────┼──────┼─────┤
           │ industry   │      0.291 │  0      │ 0.004 │  0   │ 0.4 │
           ├────────────┼────────────┼─────────┼───────┼──────┼─────┤
           │ popdens    │      0.270 │ -0.003  │ 0.005 │ -0.4 │ 0.6 │
           ├────────────┼────────────┼─────────┼───────┼──────┼─────┤
           │ urban      │      0.218 │ -0.001  │ 0.006 │ -0.1 │ 0.7 │
           ├────────────┼────────────┼─────────┼───────┼──────┼─────┤
           │ natural    │      0.212 │ -0.007  │ 0.009 │ -0.8 │ 1.1 │
           ├────────────┼────────────┼─────────┼───────┼──────┼─────┤
           │ agricultur │      0.196 │  0      │ 0.008 │  0   │ 1   │
           │ e          │            │         │       │      │     │
           ├────────────┼────────────┼─────────┼───────┼──────┼─────┤
           │ otherroads │      0.181 │ -0.004  │ 0.004 │ -0.5 │ 0.5 │
           ├────────────┼────────────┼─────────┼───────┼──────┼─────┤
           │ primaryroa │      0.095 │ -0.003  │ 0.004 │ -0.4 │ 0.5 │
           │ ds         │            │         │       │      │     │
           └────────────┴────────────┴─────────┴───────┴──────┴─────┘

Spatial evaluation 
  - Training fraction:             0.75
  - Spatial folds:                 30

    Metric Median   MAD Minimum Maximum
 r.squared  0.838 0.073   0.591   0.924

The mem() function generates Moran Eigenvector Maps (MEM) from a distance matrix.3

Show the code
mem1 = mem(distance.matrix = distance_matrix)

The package also contains a set of custom plot functions. One example is the plot_response_curves() function, which allows for the visualization of the response curves of the model.

Show the code
plot_response_curves(rf_spatial_moran)

Additional interesting functions allow for tuning the model parameters (rf_tuning()) or comparing several models (rf_compare()). A complete list of this package’s functions is available at https://blasbenito.github.io/spatialRF/reference/index.html.

The final prediction can be made using the predict() function from the terra package.

Show the code
pred_srf = terra::predict(covariates, rf_spatial_moran)
terra::plot(pred_srf[[1]])

3 sperrorest

The sperrorest (https://doi.org/10.32614/CRAN.package.sperrorest) package is designed for spatial error estimation and variable importance assessment for predictive models. The package itself does not fit the models but provides a set of functions for spatial cross-validation, including data partitioning and model cross-validation.

While the sperrorest package has many functions (including a set of functions for data partitioning), its main function is sperrorest(). It performs spatial cross-validation for spatial prediction models, including variable importance assessment and prediction error estimation. To use this function, we need to provide the formula, the data, the coordinates, the model function, the model arguments, the prediction function, the sampling function, and the sampling arguments.

Let’s do it step by step. First, we need to prepare the data by extracting the coordinates and creating a data frame with the dependent variable, covariates, and coordinates.

Show the code
library(sperrorest)
library(ranger)

coordinates = sf::st_coordinates(temperature)
temperature_df = sf::st_drop_geometry(temperature)
temperature_df$x = coordinates[, 1]
temperature_df$y = coordinates[, 2]

Second, we need to define the formula for the model and the prediction function.

Show the code
response_name = "temp"
covariate_names = colnames(temperature_df)[2:(ncol(temperature_df) - 7)]
fo = as.formula(paste(response_name, "~", paste(covariate_names, collapse = " + ")))

Third, we need to define the custom prediction function. The sperrorest package works with many model functions, but it requires a custom prediction function to extract the predictions from the model object. In this example, we use the ranger model, so we need to define a custom prediction function that extracts the predictions from the ranger model object. The predict() function from the ranger package returns a list with several elements, so we need to extract the predictions from this list.4

Show the code
mypred = function(object, newdata) {
  predict(object, newdata)$predictions
}

Fourth, we can perform the spatial cross-validation using the sperrorest() function. We just need to provide previously prepared data, the formula, the model function, and the prediction function. Moreover, we can also define some additional parameters of the model, such as the number of trees in the ranger model. Finally, the important part is to define the sampling function (smp_fun) and its arguments (smp_args). The sampling function is used to partition the data into training and testing sets: here, we use the partition_kmeans() function to partition the data spatially into folds using k-means clustering of the coordinates.5

Show the code
# Spatial cross-validation
sp_res = sperrorest(
  formula = fo,
  data = temperature_df,
  coords = c("x", "y"),
  model_fun = ranger,
  model_args = list(num.trees = 100),
  pred_fun = mypred,
  smp_fun = partition_kmeans,
  smp_args = list(repetition = 1:2, nfold = 3),
  progress = FALSE
)

The result is a list with several components, including the error at the repetition and fold levels, the resampling object, the variable importance (only when importance = TRUE), the benchmark, and the package version.

Show the code
summary(sp_res$error_rep)
                     mean           sd       median          IQR
train_bias   5.503716e-03 0.0005710437 5.503716e-03 0.0004037888
train_stddev 4.005413e-01 0.0052912326 4.005413e-01 0.0037414665
train_rmse   4.000656e-01 0.0052760852 4.000656e-01 0.0037307557
train_mad    3.269942e-01 0.0160105606 3.269942e-01 0.0113211760
train_median 4.057216e-02 0.0075211112 4.057216e-02 0.0053182287
train_iqr    4.443592e-01 0.0054738763 4.443592e-01 0.0038706151
train_count  3.900000e+02 0.0000000000 3.900000e+02 0.0000000000
test_bias    6.902773e-02 0.0588864491 6.902773e-02 0.0416390075
test_stddev  1.459237e+00 0.0019335902 1.459237e+00 0.0013672547
test_rmse    1.457718e+00 0.0047141446 1.457718e+00 0.0033334036
test_mad     1.416826e+00 0.0454142596 1.416826e+00 0.0321127309
test_median  1.398933e-01 0.0951137382 1.398933e-01 0.0672555693
test_iqr     1.881005e+00 0.0871915974 1.881005e+00 0.0616537698
test_count   1.950000e+02 0.0000000000 1.950000e+02 0.0000000000

We can contrast the obtained results with the non-spatial cross-validation by changing the sampling function to partition_cv().

Show the code
# Non-spatial cross-validation
nsp_res = sperrorest(
  formula = fo,
  data = temperature_df,
  coords = c("x", "y"),
  model_fun = ranger,
  model_args = list(num.trees = 100),
  pred_fun = mypred,
  smp_fun = partition_cv,
  smp_args = list(repetition = 1:2, nfold = 3),
  progress = FALSE
)

To compare both results, we can plot the RMSE values for the training and testing sets of both spatial and non-spatial cross-validation.

Show the code
library(ggplot2)
# Extract train/test RMSE from spatial CV
sp_train_rmse = sp_res$error_rep$train_rmse
sp_test_rmse = sp_res$error_rep$test_rmse
# Extract train/test RMSE from non-spatial CV
nsp_train_rmse = nsp_res$error_rep$train_rmse
nsp_test_rmse = nsp_res$error_rep$test_rmse
# Build data frame
rmse_df = data.frame(
  CV_Type = rep(c("Spatial", "Non-Spatial"), each = 4),
  Set = rep(c("Train", "Test"), each = 2),
  RMSE = c(sp_train_rmse, sp_test_rmse, nsp_train_rmse, nsp_test_rmse)
)
ggplot(rmse_df, aes(x = CV_Type, y = RMSE, fill = Set)) +
  geom_boxplot() +
  facet_wrap(~Set) +
  labs(title = "RMSE Comparison", x = "CV Method", y = "RMSE")

The results show that the estimation using the spatial-cross validation is less optimistic than the non-spatial cross-validation for the test set.

More examples of the package use can be found at https://giscience-fsu.github.io/sperrorest/articles/spatial-modeling-use-case.html/

4 blockCV

The blockCV (https://doi.org/10.1111/2041-210X.13107) package provides a set of functions for block cross-validation, spatial and environmental clustering, and spatial autocorrelation estimation. The package itself does not fit the models.

Show the code
library(blockCV)

Cross-validation strategies separate the data into training and testing sets to evaluate the model’s performance. The blockCV package provides several cross-validation strategies, including block cross-validation, spatial clustering, environmental clustering, buffering LOO, and Nearest Neighbour Distance Matching (NNDM) LOO.

The block cross-validation is performed using the cv_spatial() function. It assigns blocks to the training and testing folds randomly, systematically or in a checkerboard pattern (the selection argument).

Show the code
sb1 = cv_spatial(x = temperature,
                  k = 10, # number of folds
                  size = 300000, # size of the blocks in meters
                  selection = "random", # random blocks-to-fold
                  iteration = 50, # find evenly dispersed folds
                  progress = FALSE,
                  biomod2 = TRUE)

   train test
1    181   14
2    171   24
3    171   24
4    175   20
5    169   26
6    165   30
7    180   15
8    182   13
9    182   13
10   179   16

The result is a list with several components, including the folds list, the folds IDs, the biomod table, the number of folds, the input size, the column name, the blocks, and the records. For example, we can check the structure of the folds list with the str() function.

Show the code
str(sb1$folds_list)
List of 10
 $ :List of 2
  ..$ : int [1:181] 33 87 99 121 104 117 97 103 100 105 ...
  ..$ : int [1:14] 148 139 142 160 158 147 143 149 146 161 ...
 $ :List of 2
  ..$ : int [1:171] 33 87 99 121 104 117 97 103 100 105 ...
  ..$ : int [1:24] 60 26 29 36 59 27 25 37 22 40 ...
 $ :List of 2
  ..$ : int [1:171] 33 87 99 121 104 117 97 103 100 105 ...
  ..$ : int [1:24] 17 176 11 14 16 169 15 12 44 46 ...
 $ :List of 2
  ..$ : int [1:175] 33 87 99 121 104 117 97 103 100 105 ...
  ..$ : int [1:20] 185 72 159 163 184 71 178 65 183 162 ...
 $ :List of 2
  ..$ : int [1:169] 33 87 99 121 104 117 97 103 100 105 ...
  ..$ : int [1:26] 114 91 140 154 106 88 141 113 155 111 ...
 $ :List of 2
  ..$ : int [1:165] 33 87 99 121 104 117 97 103 100 105 ...
  ..$ : int [1:30] 82 53 43 78 61 79 42 76 52 48 ...
 $ :List of 2
  ..$ : int [1:180] 33 87 99 121 104 117 97 103 100 105 ...
  ..$ : int [1:15] 133 138 134 136 192 195 174 173 172 187 ...
 $ :List of 2
  ..$ : int [1:182] 33 87 99 121 104 117 97 103 100 105 ...
  ..$ : int [1:13] 129 128 123 130 125 132 127 126 116 120 ...
 $ :List of 2
  ..$ : int [1:182] 87 99 121 104 117 97 103 100 105 84 ...
  ..$ : int [1:13] 33 193 2 4 8 194 1 10 7 6 ...
 $ :List of 2
  ..$ : int [1:179] 33 60 26 29 36 59 27 25 37 22 ...
  ..$ : int [1:16] 87 99 121 104 117 97 103 100 105 84 ...

The cv_plot() function additionally allows for the visualization of cross-validation results.

Show the code
cv_plot(sb1, temperature)

Let’s compare the results of the block cross-validation with systematic and checkerboard patterns.

Show the code
sb2 = cv_spatial(x = temperature,
                  k = 10,
                  rows_cols = c(4, 6),
                  hexagon = FALSE,
                  selection = "systematic")

   train test
1    172   23
2    180   15
3    177   18
4    169   26
5    178   17
6    182   13
7    180   15
8    162   33
9    178   17
10   177   18

Show the code
cv_plot(sb2, temperature)

Show the code
sb3 = cv_spatial(x = temperature,
                  k = 10,
                  size = 300000,
                  hexagon = FALSE,
                  selection = "checkerboard")

  train test
1    98   97
2    97   98

Show the code
cv_plot(sb3, temperature)

The clustering strategies (cv_cluster()) are used to group the data into clusters based on spatial or environmental similarity. The spatial similarity is based only on the clustering of the spatial coordinates.

Show the code
set.seed(6)
scv = cv_cluster(x = temperature, k = 10)
   train test
1    178   17
2    173   22
3    182   13
4    169   26
5    179   16
6    176   19
7    178   17
8    180   15
9    171   24
10   169   26
Show the code
cv_plot(scv, temperature)

The environmental clustering, on the other hand, is based on the clustering of the values of the covariates extracted from the raster data.

Show the code
set.seed(6)
ecv = cv_cluster(x = temperature, r = covariates, k = 5, scale = TRUE)
  train test
1    90  105
2   154   41
3   182   13
4   164   31
5   190    5
Show the code
cv_plot(ecv, temperature)

The next cross-validation strategy is buffering LOO (also known as Spatial LOO). It is performed using the cv_buffer() function, which selects a buffer around each point (test point) and uses the points outside the buffer as the testing set.6

Show the code
bloo = cv_buffer(x = temperature, size = 300000, progress = FALSE)
     train            test  
 Min.   : 97.0   Min.   :1  
 Mean   :132.7   Mean   :1  
 Max.   :170.0   Max.   :1  
Show the code
cv_plot(bloo, temperature, num_plots = c(1, 50, 100))

Note that above, we plot only the first, 50th, and 100th points to avoid overplotting.

The last cross-validation strategy implemented in the blockCV package is the Nearest Neighbour Distance Matching (NNDM) LOO. It is performed using the cv_nndm() function, which tries to match the nearest neighbor distance distribution function between the test and training data to the nearest neighbor distance distribution function between the target prediction and training points. Thus, in this base, we need to provide more arguments, including a raster with the covariates, the number of samples, the sampling strategy, and the minimum training size.

Show the code
nncv = cv_nndm(x = temperature,
                r = covariates,
                size = 300000,
                num_sample = 5000, 
                sampling = "regular",
                min_train = 0.1,
                plot = TRUE)
     train            test  
 Min.   :192.0   Min.   :1  
 Mean   :192.9   Mean   :1  
 Max.   :193.0   Max.   :1  

Show the code
cv_plot(nncv, temperature, num_plots = c(1, 50, 100))

Let’s now use the block cross-validation to fit and evaluate a model.

Show the code
# define formula
response_name = "temp"
covariate_names = colnames(temperature_df)[2:(ncol(temperature_df) - 7)]
fo = as.formula(paste(response_name, "~", paste(covariate_names, collapse = " + ")))

# extract the folds
folds = sb1$folds_list

model_rmse = data.frame(fold = seq_along(folds), rmse = rep(NA, length(folds)))

for(k in seq_along(folds)){
  trainSet = unlist(folds[[k]][1]) # training set indices; first element
  testSet = unlist(folds[[k]][2]) # testing set indices; second element
  rf = ranger(fo, temperature_df[trainSet, ], num.trees = 100) # model fitting on training set
  pred = predict(rf, temperature_df[testSet, ])$predictions # predict the test set
 model_rmse[k, "rmse"] = sqrt(mean((temperature_df[testSet, response_name] - pred)^2)) # calculate RMSE
}
model_rmse
   fold      rmse
1     1 0.5555220
2     2 1.1668933
3     3 0.8910963
4     4 0.8444776
5     5 0.8323935
6     6 1.3915134
7     7 0.8645923
8     8 0.8251324
9     9 1.1584016
10   10 1.0197603

The blockCV package also provides functions for checking the similarity between the folds (cv_similarity()) and estimating the effective range of spatial autocorrelation (cv_spatial_autocor()). The first function is used to check the similarity between the folds in the cross-validation.

Show the code
cv_similarity(cv = sb1, x = temperature, r = covariates, progress = FALSE)

The second function is used to estimate the effective range of spatial autocorrelation of all input raster layers or the response data – its role is to help to determine the size of the blocks in the block cross-validation.

Show the code
cv_spatial_autocor(r = covariates, num_sample = 5000, progress = FALSE)

[1] "cv_spatial_autocor"

More examples of the package’s use can be found at https://cran.r-project.org/web/packages/blockCV/vignettes/tutorial_2.html.

5 ENMeval

https://jamiemkass.github.io/ENMeval/articles/ENMeval-2.0-vignette.html

Out-of-scope: a focus on the occurrence records (ecological niche models/species distribution models)

6 sits

https://github.com/e-sensing/sits

Out-of-scope: a focus on spatiotemporal data cubes

Footnotes

  1. Quoting the authors “RF-GLS extends RF in the same way generalized least squares (GLS) fundamentally extends ordinary least squares (OLS) to accommodate for dependence in linear models.”↩︎

  2. See ?rf_spatial for more details. Also, the 10th method is "hengl" directly following the approach by Hengl et al. (2018).↩︎

  3. mem_multithreshold() function allows for generating MEMs for multiple distance thresholds.↩︎

  4. More information on the custom prediction functions is at https://cran.r-project.org/web/packages/sperrorest/vignettes/custom-pred-and-model-functions.html.↩︎

  5. There are several other partition functions available in the package, including partition_disc(), partition_tiles(), and partition_cv().↩︎

  6. This approach is a form of leave-one-out cross-validation.↩︎